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Abstract 

The subword complexity of a finite word w of length N is a function 
which associates to each n < N the number of all distinct subwords of 
w having the length n. We define the maximal complexity C(w) as the 
maximum of the subword complexity for n e {1,2,..., N}, and the global 
maximal complexity K{N) as the maximum of C{w) for all words w of 
a fixed length N over a finite alphabet. By R(N) we will denote the set 
of the values i for which there exits a word of length N having K(N) 
subwords of length i. M{N) represents the number of words of length N 
whose maximal complexity is equal to the global maximal complexity. 

The values of K{N) and R(N) are obtained; methods to compute 
M(N) using the de Bruijn graphs and trees are given. An open problem 
is to find a formula for M{N). 



1 Introduction 

A finite word is a finite sequence of letters over a finite alphabet A, and 
can be represented as a concatenation of its letters: 

w = W1W2 ■ ■ ■ wn with Wi G A for 1 < i < N. 

The number N is the length of w and is denoted by |io|. A word with 
no letters (i.e. of length 0) is the empty word, denoted by e. We denote 
by A + the set of nonempty words over A, by A* = A + U {e} the set of 
words over A and by A n the set of words of length n over A. 

A word u is a factor (or subword) of w if there exist words x,y e A* 
such that w = xuy. If x ^ e and y ^ e then u is a proper factor {proper 
subword) of w. If x = e (y = e) then u is a prefix {suffix) of w. Let us 
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denote by F(w) the set of all nonempty factors of w, and by F n (w) the 
set of all factors of w of length n (hence F n {w) = F{w) n A n ). 

The subword complexity of w counts the number of all distinct factors 
of a given length occurring in w and is defined as 

f w (n) = Card(F„(») for 1 < n < \w\. 

Clearly f w (l) < Card(.A) and we can consider f w (n) — for n > \w\. 
The subword complexity has been extensively studied in [3], [5j and [B]. 

The maximal value of the subword complexity f w (n) for 1 < n < \w\ 
is called the maximal complexity of w and is denoted by C{w): 

C(w) = max{/ ffi (u) | n > 1}. 

The global maximal complexity in A N is equal to 

K(N) = m&x{C(w) | w G A N }. 

We shall denote by R(N) the set of values i for which there exists a 
word w e A N such that f w (i) = K(N): 

R(N) = {% e {1, 2, . . . , N} | 3w g A N : f w (i) = K(N)}. 

The number of words in A N with the maximal complexity equal to 
the global maximal complexity will be denoted by M(N): 

M(N) = Card({w e A N : C(w) = K(N)}). 

Remark 1. If Card(_4) = q, for q = 1 the only word of length N is 
w = 2.^5 ^ or w hich f Wo {i) = 1, i € {1,2,..., N}, hence C(w) = 1 = 

JV 

K(N), R(N) = {1,2,..., N} and M(N) = 1. For q > 2, but N < q, for 
each word W\ which contains N distinct elements of A we have C(w\) — 
f Wl (l) = N = K(N), R(N) = {1} and M(N) = (permutations of N 
elements taken from q). 

Some values for K(N), R(N) and M(N) in the case of an alphabet 
of 2 letters arc given in Table [T] In the case N = 3 the following six 
words have maximal complexity: 001, 010, Oil, 100, 101, 110. For each 
of them f w (l) = 2,f w (2) = 2,/ ro (3) = 1, so if (3) = 2,i?(3) = {1,2} and 
M (3) = 6. 

2 Global maximal subword complexity of fi- 
nite words 

In this section we shall compute the values of the global maximal com- 
plexity K(N), as well as those of R(N), proving that they are in agree- 
ment with the values in Table 1. Some special cases being solved in Re- 
markjl] in what follows we shall consider alphabets with Card(_4) = q > 2 
and words of length N > q. 

We shall use the following result. 
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Lemma 1. [7] For each k € -/V*, t/ie shortest word containing all the q k 
words of length k has q k + k — 1 letters (hence in this word each of the 
q k words of length k appears only once). 

An algorithm for obtaining such a word for A = {e\, e2, . . . , e 9 } is the 
following [7]: 

i. Each of the first k — 1 symbols is equal to e\. 

ii. If the sequence a\02 . . . . . . a m _fc+i . . . a m _i (with ai = . . . = 

afc_i = ei, to > k and the a's representing the e's in a certain order) 
has been obtained, the symbol a m to be added is the with the great- 
est subscript possible such that a m _fc +1 . . . a m _ia m does not duplicate a 
previously occurring section of k symbols in the above sequence. 

Hi. Rule ii is first applied for to = k (in which case a m = = e q ) 
and then applied repeatedly until a further application is impossible. 

Proposition 1. If Card(^l) = q and q k + k < N < q k+1 + k then 
K(N) =N-k. 

Proof. Let us consider at first the case N = q k+1 + k, k > 1. 
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From Lemma[l]we obtain the existence of a word W of length q k+1 +k 
which contains all the q k+1 words of length fc+1, hence fw{k+l) = q k+1 . 
It is obvious that fw(l) — q l < fw{k + 1) f° r I G {1>2, ...,&} and 
f w (k + 1 + j) = q k+1 - J < f w (k + 1) for j G {1, 2, . . . q k+1 - 1}. Any 
other word of length q k+1 + k will have the maximal complexity less than 
or equal to C{W) = fw(k + 1), hence we have K(N) = q k+1 = N - k. 

For k > 1 we consider now the values of N of the form N — q k+1 +k—r 
with r G {1,2,..., q k+1 - q k }, hence q k + k < N < q k+1 + k. If from 
the word W of length q k+1 + k considered above we delete the last r 
letters, we obtain a word Wn of length N — q k+1 + k — r with r G 
{1, 2, . . . ,q k+1 -q k }. This word will have fw N (k + l) = q k+1 -r and this 
value will be its maximal complexity. Indeed, it is obvious that fw N (k + 
1 + j) = fw N (k + !)-]< fw N (k + 1) for j G {1,2, ... ,N- k - 1}; for 
I G {l,2,...,fc} it follows that fw N (l) < q l < q k < q k+1 -r = fw N {k+l) 1 
hence C{Wm) = fw N {k + 1) = q k+1 — r. Because it is not possible 
for a word of length N = q k+1 + k — r, with r G {1,2,..., q k+1 — q k } 
to have the maximal complexity greater than q k+1 — r, it follows that 
K(N) = q k+1 - r = N - k. 

Proposition 2. If Card(yl) = q and q k + k < N < q k+1 + k + 1 then 
R(N) = {k + 1}; ifN = q k + k then R(N) = {k, k + 1}. 

Proof. In the first part of the proof of Proposition [T] we proved for 
N = q k+1 + k, k > 1, the existence of a word W of length N for which 
K(N) = fw(k + 1) = N — k. This means that k + 1 G R(N). For 
the word W, as well as for any other word w of length N, we have 
fw(l) < fw(k + 1), I 7^ k + 1, because of the special construction of W, 
which contains all the words of length k + 1 in the most compact way. It 
follows that R(N) = {k + 1}. 

As in the second part of the proof of Proposition 1, we consider N = 
q k+1 + k — r with r G {1,2,... q k+1 — q k } and the word for which 
K(N) = fw N (k + 1) = q k+1 - r. We have again k + 1 G R(N). For 
I > k + 1, it is obvious that the complexity function of Wn, or of any 
other word of length N, is strictly less than fw N {k + 1). We examine 
now the possibility of finding a word W with fw(k + 1) = N — k for 
which f w (l) = N - k for I < k. We have f w (l) < q l < q k < q k+1 - r, 
hence the equality fw(l) = N — k = q k+1 — r holds only for I = k and 
r = q k+1 - q k , that is for N — q k + k. We show that for N = q k + k 
we have indeed R(N) = {k, k + 1}. If we start with Martin's word of 
length q k + k — 1 (or with another de Bruijn word) and add to this any 
letter from A, we obtain obviously a word V of length N = q k + k, which 
contains all the q k words of length k and q k = N — k words of length 
k + 1, hence f v (k) = f v {k + 1) = K(N). 

Remark 2. Having in mind the algorithm given by Martin [7] (or other 
more efficient algorithms), words w with maximal complexity C(w) = 
K(N) can be easily constructed for each N and for both situations in 
Proposition [2j 
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3 De Bruijn graphs and trees 



In the previous section the global maximal complexity K(N) for words 
of length N was obtained, as well as the set of points R(N) where K{N) 
is equal to the maximal value of the subword complexity of certain words 
of length N. To this aim we used a special word constructed by Martin 
[TJ, which is one of the de Bruijn words. A de Bruijn word for given q 
and k is a word over an alphabet with q letters, containing all fc-length 
words exactly once. The length of such a word is q k + k — 1. 

In order to tackle the problem of finding the number of the words 
for which the global maximal complexity is attained, we shall use the de 
Bruijn graphs and trees. 

For a g-letter alphabet A the de Bruijn graph is defined as: 

B(q,k) = (V(q,k),E(q,k)) 

with V(q, k) — A as the set of vertices, and E(q, k) = A k+1 as the 
set of directed arcs. There is an arc from X1X2 ■ ■ ■ Xk to y\yi . ■ ■ yk if 
X2X3 . . . Xk = yiy2 ■ ■ ■ J/fc-i, and this arc is denoted by X1X2 ■ ■ ■ x^yk- See 
Fig. [l] and [2] for 5(2,2) and 5(2,3). The de Bruijn graphs B(q,k) are 
nonplanar for fc > 4, q > 2. 

In the de Bruijn graph B(q, k) a path (i. e. a walk with distinct 
vertices) 

aia 2 ...a k , a 2 a 3 . . . dk+i, ■ ■ ■ , a r -k+ia r -k+2 ■ ■ ■ «r (r > k) 

corresponds to an r-length word a\a,2 ■ ■ ■ a^ak+i ■ ■ ■ a r , which is obtained 
by adding, in turn, to the vertex a\a2 ■ ■ ■ afc the last letter of the following 
vertices in the path. For example in 5(2,3) the path 001, 010, 101 
corresponds to the word 00101. Every maximal length path in the graph 
B(q, k) (which is a Hamiltonian one) corresponds to a de Bruijn word. 




Fig. 1: The de Bruijn graph 5(2,2). 
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In the directed graph B(q, k) there always exists an Eulerian circuit 
because it is connected and all its vertices have the same indegree and 
outdegree q. An Eulerian circuit in B(q, k) is a Hamiltonian path in 
B(q, k + 1) (which always can be continued in a Hamiltonian cycle). For 
example in £(2,2) the following walk: 000, 001, 010, 101, 011, 111, 110, 
100 represents an Eulerian circuit, which in -6(2, 3) is a Hamiltonian 
path. 




Fig. 2: The de Bruijn graph B(2,3). 

In order to study the number of words in A k which have the maxi- 
mal complexity equal to the global maximal complexity K(N) we shall 
introduce the so-called de Bruijn trees. A de Bruijn tree T(q, w) with 
the root w E A k is a g-ary tree defined recursively as follows: 

i. The fc-length word w over the alphabet A = {e l5 e 2 , . . . .e q } is the 
root of T(q, w). 

ii. If at any step of the recursive construction of the tree, X1X2 ■ ■ ■ Xk 
is a (temporary) leaf (a vertex with outdegree equal to 0), then each word 
among X2X3 . . . x^ei, X2X3 . . . x^e%, ■ ■ ■ , X2X3 . . . x^e q which is not in the 
path from the root to X\X2 ■ ■ ■ Xk will be a descendant of X1X2 ■ ■ ■ x^. 

in. The rule ii is applied as many times as it is possible. 

A path is maximal if we cannot add an arc to its beginning or to 
its end without destroying the path property. If a maximal path is of 
maximal length then it is a Hamiltonian one. In any de Bruijn tree each 
branch is a maximal path in the de Bruijn graph B(q, k) which begins 
with the root, and all maximal paths beginning with the root occur. For 
the de Bruijn trees T(2, 000), T(2, 001), T(2, 010) and T(2, 100) see Fig. 
[H][gJ The word obtained by Martin's algorithm corresponds to the branch 
of maximal length in the right side of the de Bruijn tree T(2, 001). 

4 Methods to compute M(N) 

The number M(N) of the words of length N for which the maximal com- 
plexity is equal to the global maximal complexity K(N) can be expressed 
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g. 4: De Bruijn tree T(2,001). 
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both in terms of certain paths in a de Bruijn graph and of some vertices 
in the de Bruijn trees. 

Proposition 3. If Card(^) = q and q k + k < N < q k+1 + k then M (N) 
is equal to the number of different paths of length N — k — 1 in the de 
Bruijn graph B(q, k + 1). 

Proof. From Propositions 1 and 2 it follows that the number M(N) of 
the words of length N with global maximal complexity is given by the 
number of words w € A N with f w (k + 1) = N — k. It means that these 
words contain N — k subwords of length k + 1, all of them distinct. To 
enumerate all of them we start successively with each word of k+1 letters 
(hence with each vertex in B(q, k + 1)) and we add at each step, in turn, 
one of the symbols from A which does not duplicate a word of length 
k+1 which has already appeared. Of course, not all of the trials will 
finish in a word of length N, but those which do this, are precisely paths 
in B(q,k + 1) starting with each vertex in turn and having the length 
N — k — 1. Hence to each word of length TV with f w (k + 1) = N — k 
we can associate a path and only one of length N — k — 1 starting from 
the vertex given by the first k + 1 letters of the initial word; conversely, 
any path of length N — k — 1 will provide a word w of length N which 
contains N — k distinct subwords of length k + 1. 

Remark 3. The number of words of length N having global maximal 
complexity can be also expressed by means of certain vertices in the 
de Bruijn trees. M(N) is equal to the number of vertices at the level 
N - k - 1 in the set {T(q,w) \ w G A k+1 } of the de Bruijn trees. (The 
level of the root is considered to be 0, its descendants are on level 1 etc.) 

The other four trees corresponding to the de Bruijn graph B(2,3) 
are mirror images of those in Fig. [3]-|6j we obtain, for example, M(6) 
by doubling the number of vertices at level 3 in Fig. [3jj6j i. e. M(6) = 
2 • 18 = 36. Similarly M(7) = 2 • 21 = 42 is obtained by doubling the 
number of vertices at level 4, and so on up to M(10) = 2 • 8 = 16 
(using the vertices at level 7). These results are in accordance with those 
given in Table 1 obtained by counting all possible words with maximal 
complexity. 

A formula for the number M(N) of the words whose maximal com- 
plexity is equal to the global maximal complexity K(N) can be given for 
the special case of de Bruijn words. 

Proposition 4. If N = 2 k + k - 1 then M (N) = . 

Proof. The number of distinct Hamiltonian cycles in the de Bruijn 
graph B{2, k) is equal to 2 2 ~ k [2]. With each vertex of a Hamiltonian 
cycle a de Bruijn word (containing all the factors of length k) begins, 
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which has maximal complexity, so M(N) = 2 k ■ 2 2 ~ k , which proves 
the proposition. (In [3] the number of circular de Bruijn words is found, 
which corresponds to the number of Hamiltonian cycles in de Bruijn 
graphs). 

A generalization for q > 2 can be proved in a similar way using the 
results in pQ. 

Proposition 5. If N = q k + k - 1 then M(N) = (ql) 9 "' 1 . 

In Proposition 1, respectively Proposition 2, we have determined 
for each natural number N the value of the global maximal complex- 
ity K(N), respectively the set of values % for which there exists a word of 
length N with K(N) subwords of length i. To obtain a general formula 
for M(N) for each natural number ./V is still an open problem. 
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